Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
This is a neat data analysis course.
Here is a link to my course repo: https://github.com/akaariala/IODS-project
Hmmm… why don’t I see changes in my course diary?
Oh, now I do!
Describe the work you have done this week and summarize your learning.
The number of observations is 166, and there are 7 variables in the data. The variables include respondents gender and age. In addition, the data includes variables measuring respondents global attitude to statistics and sum of all exam points, as well as variables measuring to what extent the respondents approach learning by deep, startegic, and surface approaches.
The data has the following structure:
## Classes 'tbl_df', 'tbl' and 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Figure 1 shows the distributions of the variables in the data and how the variables are associated with each other. Most respondets are women, and the mean age of respondets is 26 years. Looking at the distribution of all variables by gender shows limited differences in distributions between men and women. An excpetion is attitude to statistics, in which men demonstrate more positive attitude than women.
Observing the correlations between variables shows that attitude to statistics is highly positively correlated with exam results (Points variable), suggesting that positive attitudes to statistics improves exam performance. Learning approaces are modestly or weakly correlated with exam results.
Learning approaches are generally rather modestly correlated with each other. However, there is one execption; as expected, the correlation between deep learning and surface learning approaches is negative.
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
I did a multiple linear regression analysis with exam points as the dependent variable and gender, age, and attitude to statistics as the independent variables. Of the independent variables, attitude to statistics is associated with better exam points (estimate 0.36, p < 0.001).
The association of gender and age with exam points is statistically insignificant, suggesting no relationship between these predictors and the outcome when controlling for all these three independent variables.
The value of multiple R-squared is 0.20, which means that the model explains 20% of the variation in the dependent variable (i.e. exam points).
##
## Call:
## lm(formula = Points ~ gender + Age + Attitude, data = regr_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## genderM -0.33054 0.91934 -0.360 0.720
## Age -0.07586 0.05367 -1.414 0.159
## Attitude 0.36066 0.05932 6.080 8.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
Removing the two statistically insignificant independent variables from the model above results in the model below. In this model, there is only on predictor: exam points. Again, exam points are statistically significantly associated with better exam points.
The value of multiple R-squared in the latter model is 0.19, which means that the model explains 19% of the variation in the dependent variable (i.e. exam points). This value is lower by 0.01 compared to the model above, indicating that the model above explained slightly larger part of the variation in exam points.
##
## Call:
## lm(formula = Points ~ Attitude, data = regr_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Linear regression model assumes first that the association between the variables of interest is linear. Second, errors between predicted and observed values of the dependent variable should be normally distributed.
We can explore the assumption that the errors are normally distributed by looking at QQ-plot (Normal Q-Q in the figure below). The plot shows that errors follow the line in the plot reasonably well, indicating that the model is reasonably well in line with the normality assumption.
One additional assumption in linear regression is that variance of errors is constant across values of dependent variable. Thus, an implication of this is that errors are not dependent on independent variables.
To test this assumption, we can explore a plot that shows the relationship between residuals and predicted values (Residuals vs fitted in the figure below). There should be no visible pattern in the plot if the assumption holds true. This seems to be the case in our plot in which the points are quite randomly scattered.
Finally, I tested how much impact single observation has on the model by exporing residual vs. leverage plot. Observing this plot shows that there are no observations that would stand out from others, suggesting that any single observation has not got a significant impact on the model.
Taken together, these three diagnostic plots show no sign of problems with modelling assumptions.
I am using data on student achievement from two Portugese schools. The data has records of student achievement and social and demographic background. In addition, the data contains variables on students’ alcohol consumption. Thus, it is possible to analyse the relationship between alcohol consumption and some other student characteristcs, which is the purpose of this exercise. Below are the names of the variables in the data.
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In this excercise, I have chosen to analyze the following four variables in relation to student’s alcohol consumption: student’s sex, student’s maternal education, extra curricular activities, and school achievement.
My hypothesis is that each of these variables are associated with alcohol consumption.
There are slightly more girl than boys in the data.
ggplot(alc2, aes(x = sex)) +
geom_bar() +
ggtitle("Sex of the students")
There are more mother’s with high education (4) than low education (0).
ggplot(alc2, aes(x = Medu)) +
geom_bar() +
ggtitle("Mother's education")
Somewhat more students participate in extracurricular activities than not.
ggplot(alc2, aes(x = activities)) +
geom_bar() +
ggtitle("Extracurricular activites")
Student’s median alcohol use is 1,5, with 1 indicating very low alcohol use and 5 indicating very high use.
ggplot(alc2, aes(y = alc_use)) +
geom_boxplot() +
ylab("Alcohol use (1 = very low, 5 = very high)") +
ggtitle("Alcohol use")
Median school achievement is 11.
ggplot(alc2, aes(y = G3)) +
geom_boxplot() +
ylab("Achievement") +
ggtitle("School achivenment")
Boys use more alcohol than girls, as hypothesized.
ggplot(alc2, aes(y = alc_use)) +
geom_boxplot() +
ylab("Alcohol use (1 = very low, 5 = very high)") +
ggtitle("Alcohol use by sex") +
facet_wrap(~ sex)
Unlike hypothesized, there are limited differences in alcohol use by maternal education.
ggplot(alc2, aes(y = alc_use)) +
geom_boxplot() +
ylab("Alcohol use (1 = very low, 5 = very high)") +
ggtitle("Alcohol use by maternal education") +
facet_wrap(~ Medu, nrow = 1)
Unlike hyptohesized, those participationg in extracurricular activities consume alcohol similarly to those who do not have extracurricular activities.
ggplot(alc2, aes(y = alc_use)) +
geom_boxplot() +
ylab("Alcohol use (1 = very low, 5 = very high)") +
ggtitle("Alcohol use by extracurricular activities") +
facet_wrap(~ activities, nrow = 1)
As suggested by the horizonal directions of the blue line, unlike hypothesized, there seems to be no relationship between alcohol consumption and school achievement, or at least the relationship is not linear and clearly visible.
ggplot(alc2, aes(x = alc_use, y = G3)) +
geom_jitter(alpha = 0.3) +
geom_smooth() +
xlab("Alcohol use (1 = very low, 5 = very high)") +
ylab("School achievement") +
ggtitle("Alcohol use and school achievement")
The same seems to be the case also when plotting the association by sex.
ggplot(alc2, aes(x = alc_use, y = G3, color = sex)) +
geom_jitter(alpha = 0.5) +
geom_smooth(aes(x = jitter(alc_use))) +
xlab("Alcohol use (1 = very low, 5 = very high)") +
ylab("School achievement") +
ggtitle("Alcohol use and school achievement by sex")
When investigating the relationship between alcohol consumption as a binary response and the other chosen variables by logistic regression modelling, we observe that only sex has a statistically significant association with high alcohol consumption (p < 0.001), when all four variables are accounted for. Thus, boys use more alcohol than girls after adjusting for mothernal education, extracurricular activites, and school achievement. Below, the relationship between the chosen variables and high alcohol consuption is presented in odds ratios, showing that boys have 2.6 (95% CI 1.7–4.2) times higher odds for high alcohol consumption than girls.
m <- glm(high_use ~ Medu + sex + activities + G3, data = alc2, family = "binomial")
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
summary(m)
##
## Call:
## glm(formula = high_use ~ Medu + sex + activities + G3, family = "binomial",
## data = alc2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2053 -0.8786 -0.6929 1.2661 1.9564
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.889149 0.389143 -2.285 0.0223 *
## Medu -0.006278 0.109342 -0.057 0.9542
## sexM 0.960712 0.237842 4.039 5.36e-05 ***
## activitiesyes -0.401774 0.234230 -1.715 0.0863 .
## G3 -0.026507 0.025003 -1.060 0.2891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 442.99 on 377 degrees of freedom
## AIC: 452.99
##
## Number of Fisher Scoring iterations: 4
round(cbind(OR, CI), digits = 2)
## OR 2.5 % 97.5 %
## (Intercept) 0.41 0.19 0.87
## Medu 0.99 0.80 1.23
## sexM 2.61 1.65 4.19
## activitiesyes 0.67 0.42 1.06
## G3 0.97 0.93 1.02
In this chapter, we study Boston housing data from R package MASS, which contains records on housing values in suburbs of Boston. The data includes observations from 506 suburbs and has 14 variables. Variable definitions can be found in here.
str(boston)
## Classes 'tbl_df', 'tbl' and 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Below is the summary of the variables in the data. There are 12 numeric and 2 integer variables in the data set and these all have different scale ranges.
summary(boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The instructions of this exercise suggested to present a graphical overview of the data and to describe the outputs and comment the distributions of the variables and their relationships. The graphical overview of the data gets very difficult to read because there are 14 variables in the data; plotting the variables pairwise leads to very small and hard to read output if functions introduced in the course are used (i.e. GGAlly::ggpairs or pairs).
Therefore, a new function corrplot::corrplot is introduced. The correlation plot below shows that several of the 14 variables are positively or negatively correlated to varying degrees.
cor_matrix <- cor(boston)
corrplot(cor_matrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Because all the variables have different scales, let’s scale the data so that all the variables have mean value of 0 and standard deviation of 1.
boston_scaled <- as_tibble(scale(boston))
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Some data wrangling before moving on to linear discriminant analysis:
# Creating a categorical crime variable
bins <- quantile(boston_scaled$crim)
crim_labels <- c("low", "med_low", "med_high", "high")
boston_scaled <- boston_scaled %>%
mutate(crim =
cut(crim,
breaks = bins,
labels = crim_labels,
include.lowest = TRUE))
# splitting the data to traning and test sets
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Correct <- test$crim
test <- dplyr::select(test, -crim)
Fitting and printing the LDA model.
# fitting the model
lda.fit <- lda(crim ~ ., data = train)
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2425743 0.2673267 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.9560042 -0.9925114 -0.11484506 -0.8951427 0.4616107
## med_low -0.1286911 -0.2730780 -0.07145661 -0.5655344 -0.1771826
## med_high -0.3744892 0.1551155 0.23803578 0.3715551 0.1576973
## high -0.4872402 1.0171960 -0.07145661 1.1004641 -0.3743028
## age dis rad tax ptratio
## low -0.8872437 0.9098856 -0.6867152 -0.7258173 -0.47000425
## med_low -0.3691889 0.3900125 -0.5435787 -0.4778358 -0.08551044
## med_high 0.4151904 -0.3673214 -0.4108279 -0.3079667 -0.33059436
## high 0.8238125 -0.8521101 1.6373367 1.5134896 0.77985517
## black lstat medv
## low 0.37911627 -0.7794448 0.5460625
## med_low 0.30947992 -0.1326050 -0.0350765
## med_high 0.08576956 -0.1136759 0.2323164
## high -0.81993068 0.8971110 -0.7187433
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06889902 0.64594617 -0.92197631
## indus 0.06824219 -0.60470280 0.65479845
## chas -0.10924288 -0.04488765 -0.06631007
## nox 0.44813595 -0.60547294 -1.26056839
## rm -0.10608257 -0.04709483 -0.13386685
## age 0.21812268 -0.38211614 -0.27868556
## dis -0.03435048 -0.24902486 0.28373375
## rad 3.19896989 0.65076457 0.27868189
## tax -0.08013812 0.45468857 0.04007320
## ptratio 0.13745484 0.02729255 -0.33446077
## black -0.12243695 0.02123019 0.16824091
## lstat 0.21457826 -0.04351640 0.44366337
## medv 0.17830346 -0.26222597 -0.21387210
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9419 0.0422 0.0159
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "grey", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes, main ="LDA (bi)plot")
lda.arrows(lda.fit, myscale = 20)
Using test data set to predict the classes with the LDA model.
The crosstable below shows how well the LDA model predicts the crime rate in Boston suburbs. The LDA model is very accurate in predicting suburbs with high crime. Additionally, concerning suburbs with actual crime rate from low to medium high, the LDA model predicts crime rate rather accurately in the same region even if the prediction is not as accurate as with suburbs with high crime rate.
# fitting the model
lda.predict <- predict(lda.fit, newdata = test)
# predicted values
Predicted <- lda.predict$class
# table of predicted vs. correct classes
descr::crosstab(Correct, Predicted, prop.r = T, plot = F)
## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## |-------------------------|
##
## =======================================================
## Predicted
## Correct low med_low med_high high Total
## -------------------------------------------------------
## low 15 9 3 0 27
## 55.6% 33.3% 11.1% 0.0% 26.5%
## -------------------------------------------------------
## med_low 7 15 6 0 28
## 25.0% 53.6% 21.4% 0.0% 27.5%
## -------------------------------------------------------
## med_high 0 5 12 1 18
## 0.0% 27.8% 66.7% 5.6% 17.6%
## -------------------------------------------------------
## high 0 0 0 29 29
## 0.0% 0.0% 0.0% 100.0% 28.4%
## -------------------------------------------------------
## Total 22 29 21 30 102
## =======================================================
Below is the summary of euclidian distances between observations.
# the data
boston <- as_tibble(MASS::Boston)
boston_scaled <- as_tibble(scale(boston))
# distance matrix
dist_eu <- dist(boston_scaled, method = "euclidian")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Here is the first set of k-means clustering done with four clusters. For easier inspection, I only plotted the pairs of firts six variables in the data set.
km <- kmeans(boston_scaled, centers = 4)
# plot the Boston dataset with clusters
pairs(boston_scaled[1:6], col = km$cluster)
I investigated the correct number of cluster by checking the total of within cluster sum of squares. After determining that the optimal number of clusters is two, I plotted the clusters again.
Interpratetion: the clusters seem to vary in across several variables. Looking at the crime variable, which has been the center if interest in this exercise, it looks like that the other cluster (black), includes suburbs with higher crime rates. The other cluster, however, includes only low crime rate suburbs.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled[1:6], col = km$cluster)
Preparing the data.
model_predictors <- dplyr::select(train, -crim)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
The figure below plots the fitted LDA model in three dimensions.
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', colors = "Dark2")
Here is the same plot with the colors of the points representing crime rates of Boston suburbs.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', color = train$crim, colors = "Dark2")
Lastly, I did the same plot but this time the points are colored by using the two categories found in the k-means clustering exercise. Comparing the two last plots visually shows that k-means clustering identified those suburbs (green dots) that had high or medium high crime rate in the previous plot. This works as a one way to validate the k-means clustering categories.
km <-kmeans(boston_scaled[2:14], centers = 2)
clusters <- factor(km$cluster[ind])
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', color = clusters, colors = "Dark2")
setwd("~/GitHub/IODS-project/data")
library(tidyverse)
library(GGally)
library(corrplot)
human <- readRDS("human.rds")
This week I am analysing data from 155 countries. The data has eigth variables on economic, demographic, and gender equality-related development of the countries. Descritive data of the variables are presented below.
summary(human)
## edu_sex_ratio labor_sex_ratio exp_education life_expectancy
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni maternal_mortality teen_births repr_w
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The correlation plot below shows that several of the variables are positively or negatively correlated. Education, economic, and health related variables are positively correlated with each other. Only women’s ratio to men in labor and women’s represenation in parliament are not really correlated with other variables or with each other.
corrplot(cor(human), method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
I will first experiment PCA with unstandardized data. The results below shows that if that is performed, already the first component captures virtually 100% of the variance. This is because PCA assumes that the greater the variance of a single variable is the more important the role of that variable is in capturing the total variance in the data. Because gross national income (GNI) is completely on its own scale compared to other variables in the data, that variable alone captures all of the variance. In the biplot (Fig 1.), we see that the correlation between GNI and principal componen 1 (PC1) is close to 1, thus suggesting a high importance of GNI to the model.
cap2 <- "ldkfjöalskd"
# pca with unstandardized data
pca_human1 <- prcomp(human)
summary(pca_human1)$importance %>% round(digits = 1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 18544.2 185.5 25.2 11.5 3.8 1.6 0.2 0.2
## Proportion of Variance 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Cumulative Proportion 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
biplot(pca_human1, cex = c(0.6, 1), col = c("grey40", "deeppink2"), main = "Fig 1. Principal component analysis with unstandardized data.", caption = cap2)
Fig 2. present the results of PCA with standardized data. Comparing Fig. 2 with Fig 1. shows the benefit of stadardizing the data: results are different and now PCA makes sense.
The first two pricipal components could be labelled as “socioeconomic development” (PC1) and “gender equity” (PC2). The first principal component suggests that measures of education level, health, and economic development of a country are positively correlated with each other (or negatively when observing bad health, i.e. maternal mortality, and teenage childbirths).
The other principal component suggests that elements of gender equity, women’s represenation in parliament and the ratio between women and men in working life, are positively correlated. However, this data suggets that socioeconomic development and gender equity are separate phenomena, and that women’s education is related to improved health, better education, and economic development, not so much to gender equity. Women’s education level is not thus necessarily alone a sufficient marker of gender equity.
# standarized data and pca with it
human_std <- scale(human)
pca_human2 <- prcomp(human_std)
# biplot axis titles
s <- summary(pca_human2)
pca_pr <- round(1*s$importance[2, ], digits = 2) * 100
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Fig 2. Principal component analysis with standardized data.")